Marks: 60
Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones and tablets. The detailed data dictionary is given below.
Data Dictionary
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set()
# To split the data into train and test
from sklearn.model_selection import train_test_split
# To build linear regression_model
import statsmodels.api as sm
# To check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# To removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
import statsmodels.api as sm
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Loading the data set
refurb = pd.read_csv('used_device_data.csv')
#making a copy of the orginal data set
df = refurb.copy()
df.shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
There are 3454 rows and 15 columns.
# checking the first 3 rows
df.head(5)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 111.62 | 74.26 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 249.39 | 174.53 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 359.47 | 165.85 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 278.93 | 169.93 |
| 4 | Honor | Android | 15.32 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 140.87 | 80.64 |
# checking the last 3 rows
df.head(5)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 111.62 | 74.26 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 249.39 | 174.53 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 359.47 | 165.85 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 278.93 | 169.93 |
| 4 | Honor | Android | 15.32 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 140.87 | 80.64 |
# checking the random samples
df.sample(10, random_state=5)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1825 | Meizu | Android | 12.88 | yes | no | 21.0 | 5.0 | 32.0 | 4.0 | 3050.0 | 168.0 | 2015 | 752 | 388.89 | 127.30 |
| 698 | Asus | Android | 15.90 | yes | no | NaN | 8.0 | 64.0 | 4.0 | 4000.0 | 165.0 | 2019 | 397 | 310.53 | 86.73 |
| 2997 | Xiaomi | Android | 12.83 | no | no | 13.0 | 5.0 | 32.0 | 4.0 | 3100.0 | 199.0 | 2014 | 1017 | 121.05 | 73.04 |
| 667 | Apple | iOS | 10.34 | yes | no | 8.0 | 1.2 | 16.0 | 4.0 | 1810.0 | 129.0 | 2014 | 877 | 361.50 | 50.94 |
| 697 | Asus | Android | 15.90 | yes | no | NaN | 8.0 | 32.0 | 4.0 | 4000.0 | 165.0 | 2019 | 524 | 299.27 | 113.06 |
| 819 | BlackBerry | Android | 15.21 | yes | no | NaN | 16.0 | 64.0 | 4.0 | 4000.0 | 170.0 | 2018 | 629 | 348.29 | 109.28 |
| 3365 | Motorola | Android | 15.34 | yes | no | NaN | 8.0 | 32.0 | 3.0 | 4000.0 | 189.4 | 2020 | 101 | 169.99 | 97.42 |
| 1896 | Micromax | Android | 10.34 | no | no | 5.0 | 2.0 | 16.0 | 4.0 | 2000.0 | 158.0 | 2014 | 797 | 90.06 | 47.22 |
| 3381 | Motorola | Android | 15.34 | yes | no | 48.0 | 25.0 | 128.0 | 4.0 | 3600.0 | 165.0 | 2019 | 422 | 254.99 | 206.65 |
| 1757 | LG | Android | 12.57 | yes | no | 8.0 | 1.3 | 16.0 | 4.0 | 2300.0 | 130.0 | 2013 | 633 | 261.47 | 69.54 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 new_price 3454 non-null float64 14 used_price 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
# Checking null values in the data
df.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
# checking duplicate values in the data
df.duplicated().sum()
print(' There are 0 duplicate values')
There are 0 duplicate values
# checking the columns in the data
df.columns
Index(['brand_name', 'os', 'screen_size', '4g', '5g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used', 'new_price', 'used_price'],
dtype='object')
# checking the unique values in the respective columns
df.nunique()
brand_name 34 os 4 screen_size 142 4g 2 5g 2 main_camera_mp 41 selfie_camera_mp 37 int_memory 15 ram 12 battery 324 weight 555 release_year 8 days_used 924 new_price 2988 used_price 3094 dtype: int64
# statistical summary of the data
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3454 | 34 | Others | 502 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3454 | 4 | Android | 3214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3454.0 | NaN | NaN | NaN | 13.713115 | 3.80528 | 5.08 | 12.7 | 12.83 | 15.34 | 30.71 |
| 4g | 3454 | 2 | yes | 2335 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3454 | 2 | no | 3302 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3275.0 | NaN | NaN | NaN | 9.460208 | 4.815461 | 0.08 | 5.0 | 8.0 | 13.0 | 48.0 |
| selfie_camera_mp | 3452.0 | NaN | NaN | NaN | 6.554229 | 6.970372 | 0.0 | 2.0 | 5.0 | 8.0 | 32.0 |
| int_memory | 3450.0 | NaN | NaN | NaN | 54.573099 | 84.972371 | 0.01 | 16.0 | 32.0 | 64.0 | 1024.0 |
| ram | 3450.0 | NaN | NaN | NaN | 4.036122 | 1.365105 | 0.02 | 4.0 | 4.0 | 4.0 | 12.0 |
| battery | 3448.0 | NaN | NaN | NaN | 3133.402697 | 1299.682844 | 500.0 | 2100.0 | 3000.0 | 4000.0 | 9720.0 |
| weight | 3447.0 | NaN | NaN | NaN | 182.751871 | 88.413228 | 69.0 | 142.0 | 160.0 | 185.0 | 855.0 |
| release_year | 3454.0 | NaN | NaN | NaN | 2015.965258 | 2.298455 | 2013.0 | 2014.0 | 2015.5 | 2018.0 | 2020.0 |
| days_used | 3454.0 | NaN | NaN | NaN | 674.869716 | 248.580166 | 91.0 | 533.5 | 690.5 | 868.75 | 1094.0 |
| new_price | 3454.0 | NaN | NaN | NaN | 237.038848 | 194.302782 | 18.2 | 120.3425 | 189.785 | 291.115 | 2560.2 |
| used_price | 3454.0 | NaN | NaN | NaN | 92.302936 | 54.701648 | 4.65 | 56.4825 | 81.87 | 116.245 | 749.52 |
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(data=data, x=feature, palette="Paired", order=data[feature].value_counts().index[:n].sort_values())
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(100 * p.get_height() / total) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(label, (x, y), ha="center", va="center", size=12, xytext=(0, 5), textcoords="offset points") # annotate the percentage
plt.show() # show the plot
histogram_boxplot( df, 'screen_size')
1 . lot of skewness in the column
histogram_boxplot( df, 'main_camera_mp')
1 . lot of skewness in the column
histogram_boxplot( df, 'selfie_camera_mp')
1 . lot of skewness in the column
histogram_boxplot( df, 'int_memory')
1 . lot of skewness in the column
histogram_boxplot( df, 'ram')
1 . Mean and median are almost the same
histogram_boxplot( df, 'battery')
histogram_boxplot( df, 'weight')
1 . lot of skewness in the column
histogram_boxplot( df, 'days_used')
1 . skewed to the left side
histogram_boxplot( df, 'new_price')
1 . lot of skewness in the column
histogram_boxplot( df, 'used_price')
1 . lot of skewness in the column
labeled_barplot ( df, 'brand_name' , perc=True)
labeled_barplot ( df , 'os' , perc=True)
labeled_barplot ( df , '4g' , perc=True)
labeled_barplot ( df , '5g' , perc=True)
histogram_boxplot( df, 'used_price')
plt.title('Distibution of used device price')
plt.show()
labeled_barplot ( df , 'os' , perc=True)
# tabular representation of data
df.groupby('brand_name')['ram'].mean().sort_values(ascending=False).reset_index()
| brand_name | ram | |
|---|---|---|
| 0 | OnePlus | 6.363636 |
| 1 | Oppo | 4.961240 |
| 2 | Vivo | 4.756410 |
| 3 | Huawei | 4.655378 |
| 4 | Honor | 4.603448 |
| 5 | Xiaomi | 4.583333 |
| 6 | 4.533333 | |
| 7 | Meizu | 4.451613 |
| 8 | Samsung | 4.199413 |
| 9 | Realme | 4.195122 |
| 10 | Sony | 4.069767 |
| 11 | Asus | 4.049180 |
| 12 | ZTE | 4.023214 |
| 13 | HTC | 4.000000 |
| 14 | Apple | 4.000000 |
| 15 | XOLO | 4.000000 |
| 16 | Microsoft | 4.000000 |
| 17 | Panasonic | 4.000000 |
| 18 | Coolpad | 3.954545 |
| 19 | Motorola | 3.943396 |
| 20 | LG | 3.936567 |
| 21 | Gionee | 3.933036 |
| 22 | Acer | 3.901961 |
| 23 | Lenovo | 3.885965 |
| 24 | BlackBerry | 3.829545 |
| 25 | Others | 3.777888 |
| 26 | Spice | 3.750000 |
| 27 | Micromax | 3.679487 |
| 28 | Alcatel | 3.407025 |
| 29 | Karbonn | 3.353448 |
| 30 | Lava | 3.277778 |
| 31 | Infinix | 2.600000 |
| 32 | Nokia | 2.420294 |
| 33 | Celkon | 1.613636 |
# plotting the data
df.groupby(['brand_name'])['ram'].mean().reset_index().sort_values(['ram']).plot(x='brand_name',y='ram',kind='bar',figsize=(15,5))
plt.show()
# Tabular representation of data
large_battery = df[df['battery'] > 4500].copy()
large_battery.groupby('brand_name')['weight'].mean().sort_values(ascending=False).reset_index()
| brand_name | weight | |
|---|---|---|
| 0 | 517.000000 | |
| 1 | Lenovo | 442.721429 |
| 2 | Apple | 439.558824 |
| 3 | Sony | 439.500000 |
| 4 | HTC | 425.000000 |
| 5 | Samsung | 398.352000 |
| 6 | Huawei | 394.486486 |
| 7 | Others | 390.546341 |
| 8 | Alcatel | 380.000000 |
| 9 | LG | 366.058333 |
| 10 | Acer | 360.000000 |
| 11 | Nokia | 318.000000 |
| 12 | Asus | 313.772727 |
| 13 | Honor | 248.714286 |
| 14 | Xiaomi | 231.500000 |
| 15 | Gionee | 209.430000 |
| 16 | Motorola | 200.757143 |
| 17 | Realme | 196.833333 |
| 18 | Vivo | 195.630769 |
| 19 | ZTE | 195.400000 |
| 20 | Oppo | 195.000000 |
| 21 | Infinix | 193.000000 |
| 22 | Panasonic | 182.000000 |
| 23 | Spice | 158.000000 |
| 24 | Micromax | 118.000000 |
# plotting the data
large_battery.groupby('brand_name')['weight'].mean().sort_values(ascending=False).reset_index().plot(x='brand_name',y='weight',kind='bar',figsize=(15,5))
plt.show()
# creating and new dataframe and filtering the screen size column for values above 6
# As the value of column screen size is in cm. converting the value of 6 inch to cm which is 6*2.54=15.24
big_screen = df[df['screen_size']> 15.24]
# calling the new data frame to see the brand names
big_screen['brand_name'].value_counts().reset_index()
| index | brand_name | |
|---|---|---|
| 0 | Huawei | 149 |
| 1 | Samsung | 119 |
| 2 | Others | 99 |
| 3 | Vivo | 80 |
| 4 | Honor | 72 |
| 5 | Oppo | 70 |
| 6 | Xiaomi | 69 |
| 7 | Lenovo | 69 |
| 8 | LG | 59 |
| 9 | Motorola | 42 |
| 10 | Asus | 41 |
| 11 | Realme | 40 |
| 12 | Alcatel | 26 |
| 13 | Apple | 24 |
| 14 | Acer | 19 |
| 15 | ZTE | 17 |
| 16 | Meizu | 17 |
| 17 | OnePlus | 16 |
| 18 | Nokia | 15 |
| 19 | Sony | 12 |
| 20 | Infinix | 10 |
| 21 | HTC | 7 |
| 22 | Micromax | 7 |
| 23 | 4 | |
| 24 | Gionee | 3 |
| 25 | XOLO | 3 |
| 26 | Coolpad | 3 |
| 27 | Karbonn | 2 |
| 28 | Spice | 2 |
| 29 | Panasonic | 2 |
| 30 | Microsoft | 1 |
plt.figure(figsize=(15,5))
sns.countplot(x='brand_name',data= big_screen,color='red',order=big_screen['brand_name'].value_counts().sort_values(ascending= True).index)
plt.xticks(rotation=90)
plt.show()
labeled_barplot ( big_screen , 'brand_name' , perc=True)
# # creating and new dataframe and filtering the 'selfie camera, column for values above 8MP
eight_mp = df[df['selfie_camera_mp']>8].copy()
# calling the new data frame to see the brand names
eight_mp['brand_name'].value_counts()
Huawei 87 Vivo 78 Oppo 75 Xiaomi 63 Samsung 57 Honor 41 Others 34 LG 32 Motorola 26 Meizu 24 HTC 20 ZTE 20 OnePlus 18 Realme 18 Sony 14 Lenovo 14 Nokia 10 Asus 6 Infinix 4 Gionee 4 Coolpad 3 Micromax 2 BlackBerry 2 Panasonic 2 Acer 1 Name: brand_name, dtype: int64
# plotting to understand the distribution of the above filtered data
histogram_boxplot( eight_mp, 'selfie_camera_mp')
plt.title('Distribution of budget devices offering greater than 8MP selfie cameras across brands')
plt.show()
df.corr()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| screen_size | 1.000000 | 0.150316 | 0.271640 | 0.071291 | 0.274449 | 0.813533 | 0.828890 | 0.364223 | -0.291723 | 0.340895 | 0.529275 |
| main_camera_mp | 0.150316 | 1.000000 | 0.429264 | 0.018766 | 0.260802 | 0.248563 | -0.087738 | 0.353728 | -0.144672 | 0.358298 | 0.459209 |
| selfie_camera_mp | 0.271640 | 0.429264 | 1.000000 | 0.296426 | 0.477411 | 0.369709 | -0.004997 | 0.690942 | -0.552636 | 0.415596 | 0.614675 |
| int_memory | 0.071291 | 0.018766 | 0.296426 | 1.000000 | 0.122496 | 0.117736 | 0.014948 | 0.235429 | -0.242712 | 0.369145 | 0.378347 |
| ram | 0.274449 | 0.260802 | 0.477411 | 0.122496 | 1.000000 | 0.280740 | 0.089916 | 0.314203 | -0.280066 | 0.494293 | 0.529434 |
| battery | 0.813533 | 0.248563 | 0.369709 | 0.117736 | 0.280740 | 1.000000 | 0.703388 | 0.488660 | -0.370895 | 0.370490 | 0.549647 |
| weight | 0.828890 | -0.087738 | -0.004997 | 0.014948 | 0.089916 | 0.703388 | 1.000000 | 0.071290 | -0.067470 | 0.219115 | 0.357983 |
| release_year | 0.364223 | 0.353728 | 0.690942 | 0.235429 | 0.314203 | 0.488660 | 0.071290 | 1.000000 | -0.750390 | 0.303571 | 0.494910 |
| days_used | -0.291723 | -0.144672 | -0.552636 | -0.242712 | -0.280066 | -0.370895 | -0.067470 | -0.750390 | 1.000000 | -0.246353 | -0.385777 |
| new_price | 0.340895 | 0.358298 | 0.415596 | 0.369145 | 0.494293 | 0.370490 | 0.219115 | 0.303571 | -0.246353 | 1.000000 | 0.809335 |
| used_price | 0.529275 | 0.459209 | 0.614675 | 0.378347 | 0.529434 | 0.549647 | 0.357983 | 0.494910 | -0.385777 | 0.809335 | 1.000000 |
plt.figure(figsize=(15,5))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.1f', center = 1 ) # heatmap
plt.show()
# creating a copy of the original data frame
missing_df = df.copy()
# checking for null values
missing_df.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
# fill missing columns with the column median
impute_missing = [
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
]
missing_df[impute_missing] = missing_df[impute_missing].apply(lambda x: x.fillna(x.median()), axis=0)
missing_df.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
plt.figure(figsize=(15,5))
sns.heatmap(missing_df.corr(), annot=True, linewidths=.5, fmt= '.1f', center = 1 ) # heatmap
plt.show()
# Checking for plots before log tranformation
columns_plot = [
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
"screen_size",
"days_used",
"new_price",
"used_price"
]
plt.figure(figsize=(15, 30))
for i in range(len(columns_plot)):
plt.subplot(9, 3, i + 1)
plt.hist(missing_df[columns_plot[i]], bins=50)
plt.tight_layout()
plt.xlabel(columns_plot[i], fontsize=15)
plt.show()
# Plot of "new_price", "used_price" and "weight" before log transformation
cols_to_log = ["new_price", "used_price", "weight"]
for colname in cols_to_log:
plt.hist(missing_df[colname], bins=50)
plt.title(colname)
plt.show()
# Log transformation
for colname in cols_to_log:
missing_df[colname + "_log"] = np.log(missing_df[colname])
missing_df.drop(cols_to_log, axis=1, inplace=True) # drop previous price columns
# Plot of "new_price", "used_price" and "weight" after log transformation
for colname in cols_to_log:
plt.hist(missing_df[colname + "_log"], bins=50)
plt.title(colname)
plt.show()
# creating a new data frame
newer_df = missing_df.copy()
# let's plot the boxplots of all numerical columns to check for outliers
plt.figure(figsize=(15, 35))
for i, variable in enumerate(newer_df.select_dtypes(include=np.number).columns.tolist()):
plt.subplot(9, 3, i + 1)
plt.boxplot(newer_df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Outlier detection using IQR on column 'weight_log'
plt.hist(newer_df['weight_log'], 20)
plt.title('Histogram of weight_log')
plt.show()
sns.boxplot(newer_df['weight_log'])
plt.title('Boxplot of weight_log')
plt.show()
# Outlier detection using IQR on column 'new_price_log'
plt.hist(newer_df['new_price_log'], 20)
plt.title('Histogram of new_price_log')
plt.show()
sns.boxplot(newer_df['new_price_log'])
plt.title('Boxplot of new_price_log')
plt.show()
# Outlier detection using IQR on column 'used_price_log'
plt.hist(newer_df['used_price_log'], 20)
plt.title('Histogram of used_price_log')
plt.show()
sns.boxplot(newer_df['used_price_log'])
plt.title('Boxplot of used_price_log')
plt.show()
# functions to treat outliers by flooring and capping
def treat_outliers(df, col):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c)
return df
# creating a list of columns that needs to to be treated
treat_out_cols = ["weight_log", "new_price_log", "used_price_log"]
# Applying the function
fit_df = treat_outliers_all(newer_df, treat_out_cols)
plt.figure(figsize=(15, 35))
for i, variable in enumerate(treat_out_cols):
plt.subplot(9, 3, i + 1)
plt.boxplot(fit_df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Outlier detection using IQR on column 'weight_log'
plt.hist(fit_df['weight_log'], 20)
plt.title('Histogram of weight_log')
plt.show()
sns.boxplot(fit_df['weight_log'])
plt.title('Boxplot of weight_log')
plt.show()
# Outlier detection using IQR on column 'new_price_log'
plt.hist(fit_df['new_price_log'], 20)
plt.title('Histogram of new_price_log')
plt.show()
sns.boxplot(fit_df['new_price_log'])
plt.title('Boxplot of new_price_log')
plt.show()
# Outlier detection using IQR on column 'used_price_log'
plt.hist(fit_df['used_price_log'], 20)
plt.title('Histogram of used_price_log')
plt.show()
sns.boxplot(fit_df['used_price_log'])
plt.title('Boxplot of used_price_log')
plt.show()
fit_df.head(5)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | release_year | days_used | new_price_log | used_price_log | weight_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 2020 | 127 | 4.715100 | 4.307572 | 4.983607 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 2020 | 325 | 5.519018 | 5.162097 | 5.361292 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 2020 | 162 | 5.884631 | 5.111084 | 5.361292 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 2020 | 345 | 5.630961 | 5.135387 | 5.617149 |
| 4 | Honor | Android | 15.32 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 2020 | 293 | 4.947837 | 4.389995 | 5.220356 |
fit_df.shape
print(f"There are {fit_df.shape[0]} rows and {fit_df.shape[1]} columns.")
There are 3454 rows and 15 columns.
fit_df.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 release_year 0 days_used 0 new_price_log 0 used_price_log 0 weight_log 0 dtype: int64
fit_df.duplicated().sum()
0
fit_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3454 non-null float64 6 selfie_camera_mp 3454 non-null float64 7 int_memory 3454 non-null float64 8 ram 3454 non-null float64 9 battery 3454 non-null float64 10 release_year 3454 non-null int64 11 days_used 3454 non-null int64 12 new_price_log 3454 non-null float64 13 used_price_log 3454 non-null float64 14 weight_log 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
# converting the objet dt types to category
category_col = fit_df.select_dtypes(exclude=np.number).columns.tolist()
fit_df[category_col] = fit_df[category_col].astype("category")
fit_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null category 1 os 3454 non-null category 2 screen_size 3454 non-null float64 3 4g 3454 non-null category 4 5g 3454 non-null category 5 main_camera_mp 3454 non-null float64 6 selfie_camera_mp 3454 non-null float64 7 int_memory 3454 non-null float64 8 ram 3454 non-null float64 9 battery 3454 non-null float64 10 release_year 3454 non-null int64 11 days_used 3454 non-null int64 12 new_price_log 3454 non-null float64 13 used_price_log 3454 non-null float64 14 weight_log 3454 non-null float64 dtypes: category(4), float64(9), int64(2) memory usage: 312.2 KB
histogram_boxplot( fit_df, 'weight_log')
histogram_boxplot( fit_df, 'new_price_log')
histogram_boxplot( fit_df, 'used_price_log')
# Plot to show relation between target variable Used_price and the brand_name on the page
plt.figure(figsize=(15,5))
sns.scatterplot(x ='brand_name', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(15,5))
sns.scatterplot(x ='os', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(15,5))
sns.scatterplot(x ='4g', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(15,5))
sns.scatterplot(x ='5g', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(15,5))
sns.scatterplot(x ='main_camera_mp', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
# Plot to show relation between target variable Used_price and the 'selfie_camera_mp_log' on the page
plt.figure(figsize=(15,5))
sns.scatterplot(x ='selfie_camera_mp', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(15,5))
sns.scatterplot(x ='int_memory', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(15,5))
sns.scatterplot(x ='ram', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
# Plot to show relation between target variable Used_price and the 'weight_log' on the page
plt.figure(figsize=(15,5))
sns.lmplot(x ='weight_log', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
<Figure size 1080x360 with 0 Axes>
# Plot to show relation between target variable Used_price and the 'screen_size_log' on the page
plt.figure(figsize=(15,5))
sns.scatterplot(x ='screen_size', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(15,5))
sns.scatterplot(x ='release_year', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(15,5))
sns.lmplot(x ='days_used', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
<Figure size 1080x360 with 0 Axes>
plt.figure(figsize=(15,5))
sns.lmplot(x ='new_price_log', y='used_price_log', data = fit_df)
plt.xticks(rotation=90)
plt.show()
<Figure size 1080x360 with 0 Axes>
plt.figure(figsize=(15,5))
sns.pairplot(fit_df, diag_kind="kde")
plt.show()
<Figure size 1080x360 with 0 Axes>
plt.figure(figsize=(15,5))
sns.heatmap(fit_df.corr(), annot=True, linewidths=.5, fmt= '.1f', center = 1 ) # heatmap
plt.show()
# defining X and y variables
X = fit_df.drop(["used_price_log"], axis=1)
y = fit_df["used_price_log"]
# let's add the intercept to data
X = sm.add_constant(X, has_constant='add')
X.head()
| const | brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | release_year | days_used | new_price_log | weight_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | Honor | Android | 14.50 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 2020 | 127 | 4.715100 | 4.983607 |
| 1 | 1.0 | Honor | Android | 17.30 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 2020 | 325 | 5.519018 | 5.361292 |
| 2 | 1.0 | Honor | Android | 16.69 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 2020 | 162 | 5.884631 | 5.361292 |
| 3 | 1.0 | Honor | Android | 25.50 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 2020 | 345 | 5.630961 | 5.617149 |
| 4 | 1.0 | Honor | Android | 15.32 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 2020 | 293 | 4.947837 | 5.220356 |
# creating dummy variables
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
# to ensure all variables are of float type
X = X.astype(float)
X.head()
| const | screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | release_year | days_used | new_price_log | weight_log | brand_name_Alcatel | brand_name_Apple | brand_name_Asus | brand_name_BlackBerry | brand_name_Celkon | brand_name_Coolpad | brand_name_Gionee | brand_name_Google | brand_name_HTC | brand_name_Honor | brand_name_Huawei | brand_name_Infinix | brand_name_Karbonn | brand_name_LG | brand_name_Lava | brand_name_Lenovo | brand_name_Meizu | brand_name_Micromax | brand_name_Microsoft | brand_name_Motorola | brand_name_Nokia | brand_name_OnePlus | brand_name_Oppo | brand_name_Others | brand_name_Panasonic | brand_name_Realme | brand_name_Samsung | brand_name_Sony | brand_name_Spice | brand_name_Vivo | brand_name_XOLO | brand_name_Xiaomi | brand_name_ZTE | os_Others | os_Windows | os_iOS | 4g_yes | 5g_yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 14.50 | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 2020.0 | 127.0 | 4.715100 | 4.983607 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 1.0 | 17.30 | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 2020.0 | 325.0 | 5.519018 | 5.361292 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 2 | 1.0 | 16.69 | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 2020.0 | 162.0 | 5.884631 | 5.361292 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 3 | 1.0 | 25.50 | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 2020.0 | 345.0 | 5.630961 | 5.617149 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 4 | 1.0 | 15.32 | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 2020.0 | 293.0 | 4.947837 | 5.220356 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
# splitting the data in 70:30 ratio for train to test data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 2417 Number of rows in test data = 1037
# fitting a linear model
olsmodel = sm.OLS(y_train, x_train).fit()
print(olsmodel.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.846
Model: OLS Adj. R-squared: 0.843
Method: Least Squares F-statistic: 271.1
Date: Wed, 02 Mar 2022 Prob (F-statistic): 0.00
Time: 18:59:03 Log-Likelihood: 238.67
No. Observations: 2417 AIC: -379.3
Df Residuals: 2368 BIC: -95.62
Df Model: 48
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -31.0790 8.618 -3.606 0.000 -47.978 -14.179
screen_size 0.0289 0.003 10.997 0.000 0.024 0.034
main_camera_mp 0.0200 0.001 14.390 0.000 0.017 0.023
selfie_camera_mp 0.0140 0.001 13.045 0.000 0.012 0.016
int_memory 0.0001 6.61e-05 1.955 0.051 -4.01e-07 0.000
ram 0.0184 0.005 3.763 0.000 0.009 0.028
battery -6.28e-06 6.75e-06 -0.930 0.352 -1.95e-05 6.96e-06
release_year 0.0154 0.004 3.601 0.000 0.007 0.024
days_used 3.88e-05 2.94e-05 1.319 0.187 -1.89e-05 9.65e-05
new_price_log 0.4013 0.012 33.290 0.000 0.378 0.425
weight_log 0.3061 0.034 8.886 0.000 0.239 0.374
brand_name_Alcatel 0.0078 0.045 0.173 0.863 -0.081 0.097
brand_name_Apple 0.0033 0.141 0.024 0.981 -0.272 0.279
brand_name_Asus 0.0184 0.046 0.402 0.688 -0.071 0.108
brand_name_BlackBerry -0.0972 0.067 -1.451 0.147 -0.229 0.034
brand_name_Celkon 0.0027 0.063 0.042 0.966 -0.121 0.127
brand_name_Coolpad 0.0450 0.069 0.647 0.517 -0.091 0.181
brand_name_Gionee 0.0414 0.055 0.752 0.452 -0.067 0.149
brand_name_Google 0.0071 0.081 0.088 0.930 -0.151 0.165
brand_name_HTC 0.0033 0.046 0.072 0.942 -0.087 0.093
brand_name_Honor 0.0215 0.047 0.459 0.646 -0.070 0.114
brand_name_Huawei -0.0022 0.042 -0.051 0.960 -0.085 0.081
brand_name_Infinix 0.1109 0.089 1.248 0.212 -0.063 0.285
brand_name_Karbonn 0.0997 0.064 1.558 0.119 -0.026 0.225
brand_name_LG -0.0041 0.043 -0.095 0.924 -0.089 0.081
brand_name_Lava 0.0318 0.059 0.535 0.593 -0.085 0.148
brand_name_Lenovo 0.0416 0.043 0.964 0.335 -0.043 0.126
brand_name_Meizu 0.0185 0.053 0.347 0.729 -0.086 0.123
brand_name_Micromax -0.0013 0.046 -0.030 0.976 -0.091 0.088
brand_name_Microsoft 0.0919 0.084 1.092 0.275 -0.073 0.257
brand_name_Motorola 0.0039 0.047 0.082 0.934 -0.089 0.096
brand_name_Nokia 0.0553 0.049 1.122 0.262 -0.041 0.152
brand_name_OnePlus 0.1379 0.074 1.868 0.062 -0.007 0.283
brand_name_Oppo 0.0261 0.046 0.573 0.567 -0.063 0.115
brand_name_Others -0.0019 0.040 -0.048 0.962 -0.081 0.077
brand_name_Panasonic 0.0612 0.053 1.149 0.251 -0.043 0.166
brand_name_Realme 0.1057 0.059 1.806 0.071 -0.009 0.220
brand_name_Samsung -0.0173 0.041 -0.420 0.674 -0.098 0.063
brand_name_Sony -0.0387 0.048 -0.806 0.421 -0.133 0.055
brand_name_Spice -0.0126 0.060 -0.209 0.835 -0.131 0.106
brand_name_Vivo -0.0065 0.046 -0.141 0.888 -0.097 0.084
brand_name_XOLO 0.0169 0.052 0.323 0.747 -0.086 0.119
brand_name_Xiaomi 0.0817 0.046 1.781 0.075 -0.008 0.172
brand_name_ZTE 0.0018 0.045 0.041 0.967 -0.087 0.090
os_Others 0.0914 0.030 3.031 0.002 0.032 0.151
os_Windows 0.0022 0.043 0.052 0.959 -0.082 0.087
os_iOS -0.0256 0.140 -0.183 0.855 -0.300 0.249
4g_yes 0.0490 0.015 3.245 0.001 0.019 0.079
5g_yes -0.0217 0.030 -0.715 0.474 -0.081 0.038
==============================================================================
Omnibus: 131.508 Durbin-Watson: 1.908
Prob(Omnibus): 0.000 Jarque-Bera (JB): 185.699
Skew: -0.488 Prob(JB): 4.74e-41
Kurtosis: 3.945 Cond. No. 7.55e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.55e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsmodel, x_train, y_train)
olsmodel_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.219219 | 0.17269 | 0.846065 | 0.842878 | 4.054856 |
print("Test Performance\n")
olsmodel_test_perf = model_performance_regression(olsmodel, x_test, y_test)
olsmodel_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.225278 | 0.176743 | 0.845064 | 0.837372 | 4.158587 |
Observations
The train and test $R^2$ are 0.84 and 0.84, indicating that the model explains 84% and 84% of the total variation in the train and test sets respectively. Also, both scores are comparable.
RMSE values on the train and test sets are also comparable.
This shows that the model is not overfitting.
MAE indicates that our current model is able to predict used_price within a mean error of 0.18 on the test set.
MAPE of 4.40 on the test data means that we are able to predict within 5% of the used_price.
General Rule of thumb:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(x_train)
| feature | VIF | |
|---|---|---|
| 0 | const | 3.659603e+06 |
| 1 | screen_size | 5.019206e+00 |
| 2 | main_camera_mp | 2.095425e+00 |
| 3 | selfie_camera_mp | 2.801130e+00 |
| 4 | int_memory | 1.347808e+00 |
| 5 | ram | 2.255870e+00 |
| 6 | battery | 3.863617e+00 |
| 7 | release_year | 4.724703e+00 |
| 8 | days_used | 2.657136e+00 |
| 9 | new_price_log | 3.138088e+00 |
| 10 | weight_log | 3.607424e+00 |
| 11 | brand_name_Alcatel | 3.406049e+00 |
| 12 | brand_name_Apple | 1.312846e+01 |
| 13 | brand_name_Asus | 3.333711e+00 |
| 14 | brand_name_BlackBerry | 1.634120e+00 |
| 15 | brand_name_Celkon | 1.773670e+00 |
| 16 | brand_name_Coolpad | 1.466615e+00 |
| 17 | brand_name_Gionee | 1.951288e+00 |
| 18 | brand_name_Google | 1.323411e+00 |
| 19 | brand_name_HTC | 3.408670e+00 |
| 20 | brand_name_Honor | 3.344491e+00 |
| 21 | brand_name_Huawei | 5.986196e+00 |
| 22 | brand_name_Infinix | 1.282467e+00 |
| 23 | brand_name_Karbonn | 1.573752e+00 |
| 24 | brand_name_LG | 4.849253e+00 |
| 25 | brand_name_Lava | 1.711512e+00 |
| 26 | brand_name_Lenovo | 4.559872e+00 |
| 27 | brand_name_Meizu | 2.172896e+00 |
| 28 | brand_name_Micromax | 3.360505e+00 |
| 29 | brand_name_Microsoft | 1.868250e+00 |
| 30 | brand_name_Motorola | 3.258696e+00 |
| 31 | brand_name_Nokia | 3.464900e+00 |
| 32 | brand_name_OnePlus | 1.437220e+00 |
| 33 | brand_name_Oppo | 3.972862e+00 |
| 34 | brand_name_Others | 9.701182e+00 |
| 35 | brand_name_Panasonic | 2.105939e+00 |
| 36 | brand_name_Realme | 1.931075e+00 |
| 37 | brand_name_Samsung | 7.538927e+00 |
| 38 | brand_name_Sony | 2.930318e+00 |
| 39 | brand_name_Spice | 1.689328e+00 |
| 40 | brand_name_Vivo | 3.647282e+00 |
| 41 | brand_name_XOLO | 2.138167e+00 |
| 42 | brand_name_Xiaomi | 3.714093e+00 |
| 43 | brand_name_ZTE | 3.794194e+00 |
| 44 | os_Others | 1.727926e+00 |
| 45 | os_Windows | 1.595801e+00 |
| 46 | os_iOS | 1.183540e+01 |
| 47 | 4g_yes | 2.464041e+00 |
| 48 | 5g_yes | 1.842298e+00 |
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmodel = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
# To Remove multicollinearity we drop columns with VIF one by one greater than 5
# Drop the variable that makes the least change in adjusted R-squared.
# Continue till you get all VIF scores under 5.
# List of columns to be dropped
col_list = [
"screen_size",
"brand_name_Huawei",
"brand_name_Others",
"brand_name_Samsung",
]
res = treating_multicollinearity(x_train, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | brand_name_Others | 0.843011 | 0.221428 |
| 1 | brand_name_Huawei | 0.843011 | 0.221428 |
| 2 | brand_name_Samsung | 0.842999 | 0.221437 |
| 3 | screen_size | 0.834994 | 0.227012 |
# dropping the column brand_name_Huawei
col_to_drop = "screen_size"
x_train2 = x_train.loc[:, ~x_train.columns.str.startswith(col_to_drop)]
x_test2 = x_test.loc[:, ~x_test.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping screen_size
| feature | VIF | |
|---|---|---|
| 0 | const | 3.647135e+06 |
| 1 | main_camera_mp | 2.088396e+00 |
| 2 | selfie_camera_mp | 2.800263e+00 |
| 3 | int_memory | 1.345798e+00 |
| 4 | ram | 2.253579e+00 |
| 5 | battery | 2.972752e+00 |
| 6 | release_year | 4.712704e+00 |
| 7 | days_used | 2.655663e+00 |
| 8 | new_price_log | 3.108293e+00 |
| 9 | weight_log | 2.567919e+00 |
| 10 | brand_name_Alcatel | 3.406047e+00 |
| 11 | brand_name_Apple | 1.297355e+01 |
| 12 | brand_name_Asus | 3.326370e+00 |
| 13 | brand_name_BlackBerry | 1.629603e+00 |
| 14 | brand_name_Celkon | 1.773075e+00 |
| 15 | brand_name_Coolpad | 1.466490e+00 |
| 16 | brand_name_Gionee | 1.933135e+00 |
| 17 | brand_name_Google | 1.322006e+00 |
| 18 | brand_name_HTC | 3.401478e+00 |
| 19 | brand_name_Honor | 3.344488e+00 |
| 20 | brand_name_Huawei | 5.982784e+00 |
| 21 | brand_name_Infinix | 1.278877e+00 |
| 22 | brand_name_Karbonn | 1.573227e+00 |
| 23 | brand_name_LG | 4.835518e+00 |
| 24 | brand_name_Lava | 1.710967e+00 |
| 25 | brand_name_Lenovo | 4.553393e+00 |
| 26 | brand_name_Meizu | 2.170450e+00 |
| 27 | brand_name_Micromax | 3.353692e+00 |
| 28 | brand_name_Microsoft | 1.867800e+00 |
| 29 | brand_name_Motorola | 3.247439e+00 |
| 30 | brand_name_Nokia | 3.460452e+00 |
| 31 | brand_name_OnePlus | 1.436900e+00 |
| 32 | brand_name_Oppo | 3.966562e+00 |
| 33 | brand_name_Others | 9.653104e+00 |
| 34 | brand_name_Panasonic | 2.105033e+00 |
| 35 | brand_name_Realme | 1.922178e+00 |
| 36 | brand_name_Samsung | 7.528652e+00 |
| 37 | brand_name_Sony | 2.926532e+00 |
| 38 | brand_name_Spice | 1.683118e+00 |
| 39 | brand_name_Vivo | 3.644537e+00 |
| 40 | brand_name_XOLO | 2.137217e+00 |
| 41 | brand_name_Xiaomi | 3.699824e+00 |
| 42 | brand_name_ZTE | 3.787506e+00 |
| 43 | os_Others | 1.627597e+00 |
| 44 | os_Windows | 1.595312e+00 |
| 45 | os_iOS | 1.167843e+01 |
| 46 | 4g_yes | 2.446153e+00 |
| 47 | 5g_yes | 1.839026e+00 |
# After dropping brand_name_Huawei "brand_name_Huawei", "brand_name_Others", "brand_name_Samsung",
# list of columns to be dropped
col_list = [
"brand_name_Huawei",
"brand_name_Others",
"brand_name_Samsung",
]
res = treating_multicollinearity(x_train2, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | brand_name_Huawei | 0.835057 | 0.226968 |
| 1 | brand_name_Others | 0.835019 | 0.226995 |
| 2 | brand_name_Samsung | 0.835019 | 0.226995 |
# dropping brand_name_Apple
col_to_drop = "brand_name_Others"
x_train3 = x_train2.loc[:, ~x_train2.columns.str.startswith(col_to_drop)]
x_test3 = x_test2.loc[:, ~x_test2.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train3)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping brand_name_Others
| feature | VIF | |
|---|---|---|
| 0 | const | 3.647100e+06 |
| 1 | main_camera_mp | 2.086809e+00 |
| 2 | selfie_camera_mp | 2.800263e+00 |
| 3 | int_memory | 1.344974e+00 |
| 4 | ram | 2.252473e+00 |
| 5 | battery | 2.969930e+00 |
| 6 | release_year | 4.712458e+00 |
| 7 | days_used | 2.655481e+00 |
| 8 | new_price_log | 3.108105e+00 |
| 9 | weight_log | 2.567668e+00 |
| 10 | brand_name_Alcatel | 1.203574e+00 |
| 11 | brand_name_Apple | 1.207764e+01 |
| 12 | brand_name_Asus | 1.201901e+00 |
| 13 | brand_name_BlackBerry | 1.129657e+00 |
| 14 | brand_name_Celkon | 1.171172e+00 |
| 15 | brand_name_Coolpad | 1.052605e+00 |
| 16 | brand_name_Gionee | 1.083772e+00 |
| 17 | brand_name_Google | 1.048693e+00 |
| 18 | brand_name_HTC | 1.225399e+00 |
| 19 | brand_name_Honor | 1.272023e+00 |
| 20 | brand_name_Huawei | 1.497703e+00 |
| 21 | brand_name_Infinix | 1.057923e+00 |
| 22 | brand_name_Karbonn | 1.069523e+00 |
| 23 | brand_name_LG | 1.353842e+00 |
| 24 | brand_name_Lava | 1.069857e+00 |
| 25 | brand_name_Lenovo | 1.297060e+00 |
| 26 | brand_name_Meizu | 1.132562e+00 |
| 27 | brand_name_Micromax | 1.224268e+00 |
| 28 | brand_name_Microsoft | 1.494609e+00 |
| 29 | brand_name_Motorola | 1.247933e+00 |
| 30 | brand_name_Nokia | 1.499863e+00 |
| 31 | brand_name_OnePlus | 1.081470e+00 |
| 32 | brand_name_Oppo | 1.376532e+00 |
| 33 | brand_name_Panasonic | 1.106046e+00 |
| 34 | brand_name_Realme | 1.154539e+00 |
| 35 | brand_name_Samsung | 1.595225e+00 |
| 36 | brand_name_Sony | 1.204357e+00 |
| 37 | brand_name_Spice | 1.081151e+00 |
| 38 | brand_name_Vivo | 1.317638e+00 |
| 39 | brand_name_XOLO | 1.116135e+00 |
| 40 | brand_name_Xiaomi | 1.305557e+00 |
| 41 | brand_name_ZTE | 1.264491e+00 |
| 42 | os_Others | 1.626990e+00 |
| 43 | os_Windows | 1.593918e+00 |
| 44 | os_iOS | 1.167842e+01 |
| 45 | 4g_yes | 2.439852e+00 |
| 46 | 5g_yes | 1.838854e+00 |
There is no more VIF above 5 . Hence, we can confirm that the assumption of multicollinearity is satisfied
# initial list of columns
cols = x_train3.columns.tolist()
# setting an initial max p-value
max_p_value = 1
# Loop to check for p-values of the variables and drop the column with the highest p-value.
while len(cols) > 0:
# defining the train set
x_train_aux = x_train3[cols]
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols # variables with p-values lesser than 0.05
print(selected_features)
['const', 'main_camera_mp', 'selfie_camera_mp', 'ram', 'battery', 'release_year', 'new_price_log', 'weight_log', 'brand_name_BlackBerry', 'brand_name_Karbonn', 'brand_name_OnePlus', 'brand_name_Xiaomi', '4g_yes']
# Use only the variables with p-values less than 0.05 to train model
x_train4 = x_train3[selected_features]
x_test4 = x_test3[selected_features]
olsmod2 = sm.OLS(y_train, x_train4).fit()
print(olsmod2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.836
Model: OLS Adj. R-squared: 0.835
Method: Least Squares F-statistic: 1020.
Date: Wed, 02 Mar 2022 Prob (F-statistic): 0.00
Time: 18:59:16 Log-Likelihood: 161.29
No. Observations: 2417 AIC: -296.6
Df Residuals: 2404 BIC: -221.3
Df Model: 12
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -35.4270 6.765 -5.237 0.000 -48.693 -22.161
main_camera_mp 0.0190 0.001 14.868 0.000 0.017 0.022
selfie_camera_mp 0.0141 0.001 13.712 0.000 0.012 0.016
ram 0.0127 0.004 3.037 0.002 0.005 0.021
battery 3.168e-05 5.89e-06 5.375 0.000 2.01e-05 4.32e-05
release_year 0.0172 0.003 5.124 0.000 0.011 0.024
new_price_log 0.4084 0.011 37.835 0.000 0.387 0.430
weight_log 0.4995 0.029 17.387 0.000 0.443 0.556
brand_name_BlackBerry -0.1177 0.054 -2.183 0.029 -0.223 -0.012
brand_name_Karbonn 0.1258 0.053 2.381 0.017 0.022 0.229
brand_name_OnePlus 0.1375 0.064 2.148 0.032 0.012 0.263
brand_name_Xiaomi 0.0607 0.025 2.443 0.015 0.012 0.109
4g_yes 0.0337 0.015 2.314 0.021 0.005 0.062
==============================================================================
Omnibus: 122.315 Durbin-Watson: 1.916
Prob(Omnibus): 0.000 Jarque-Bera (JB): 170.281
Skew: -0.467 Prob(JB): 1.06e-37
Kurtosis: 3.904 Cond. No. 5.71e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.71e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
# plot for fitted values vs residuals.
# Create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmod2.fittedvalues # predicted values
df_pred["Residuals"] = olsmod2.resid # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 3026 | 4.087488 | 3.819438 | 0.268050 |
| 1525 | 4.448399 | 4.694680 | -0.246281 |
| 1128 | 4.315353 | 4.310987 | 0.004366 |
| 3003 | 4.282068 | 4.284963 | -0.002895 |
| 2907 | 4.456438 | 4.488880 | -0.032442 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
We see no pattern in the plot above. Hence, the assumptions of linearity and independence are satisfied.
Null hypothesis: Residuals are normally distributed
Alternate hypothesis: Residuals are not normally distributed
# histogram plot of the residual
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
# Shipiro test for normality
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.9800310134887695, pvalue=6.362487569710371e-18)
Homoscedascity: If the variance of the residuals is symmetrically distributed across the regression line, then the data is said to be homoscedastic.
Heteroscedascity: If the variance is unequal for the residuals across the regression line, then the data is said to be heteroscedastic.
Why the test?
The presence of non-constant variance in the error terms results in heteroscedasticity. Generally, non-constant variance arises in presence of outliers.
We will test for homoscedasticity by using the goldfeldquandt test.
*If we get a p-value greater than 0.05, we can say that the residuals are homoscedastic. Otherwise, they are heteroscedastic.
# goldfeldquandt test for homoscedasticity
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train4)
lzip(name, test)
[('F statistic', 1.096951960398924), ('p-value', 0.05489357698368678)]
sns.scatterplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple"
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
# Let us write the equation of linear regression
Equation = "Used Phone Price ="
print(Equation, end=" ")
for i in range(len(x_train4.columns)):
if i == 0:
print(np.round(olsmod2.params[i], 4), "+", end=" ")
elif i != len(x_train4.columns) - 1:
print(
"(",
np.round(olsmod2.params[i], 4),
")*(",
x_train4.columns[i],
")",
"+",
end=" ",
)
else:
print("(", np.round(olsmod2.params[i], 4), ")*(", x_train4.columns[i], ")")
Used Phone Price = -35.427 + ( 0.019 )*( main_camera_mp ) + ( 0.0141 )*( selfie_camera_mp ) + ( 0.0127 )*( ram ) + ( 0.0 )*( battery ) + ( 0.0172 )*( release_year ) + ( 0.4084 )*( new_price_log ) + ( 0.4995 )*( weight_log ) + ( -0.1177 )*( brand_name_BlackBerry ) + ( 0.1258 )*( brand_name_Karbonn ) + ( 0.1375 )*( brand_name_OnePlus ) + ( 0.0607 )*( brand_name_Xiaomi ) + ( 0.0337 )*( 4g_yes )
# predictions on the test set
pred = olsmod2.predict(x_test4)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(10, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 1995 | 4.566741 | 4.381640 |
| 2341 | 3.696103 | 4.022802 |
| 1913 | 3.592093 | 3.656283 |
| 688 | 4.306495 | 4.115864 |
| 650 | 4.522115 | 5.151518 |
| 2291 | 4.259294 | 4.386796 |
| 40 | 4.997685 | 5.290374 |
| 1884 | 3.875359 | 4.073931 |
| 2538 | 4.206631 | 4.011321 |
| 45 | 5.380450 | 5.287940 |
df2 = df_pred_test.sample(25, random_state=1)
df2.plot(kind="bar", figsize=(15, 7))
plt.show()
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmod2_train_perf = model_performance_regression(olsmod2, x_train4, y_train)
olsmod2_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.226351 | 0.177722 | 0.835886 | 0.834998 | 4.176789 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmod2_test_perf = model_performance_regression(olsmod2, x_test4, y_test)
olsmod2_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.231239 | 0.181287 | 0.836756 | 0.834682 | 4.265933 |
The model is able to explain ~83% of the variation in the data, which is very good.
The train and test RMSE and MAE are low and comparable. So, our model is not suffering from overfitting.
The MAPE on the test set suggests we can predict within ~4.3% of the used_price.
Hence, we can conclude the model olsmod2 is good for prediction as well as inference purposes.
olsmodel_final = sm.OLS(y_train, x_train4).fit()
print(olsmodel_final.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.836
Model: OLS Adj. R-squared: 0.835
Method: Least Squares F-statistic: 1020.
Date: Wed, 02 Mar 2022 Prob (F-statistic): 0.00
Time: 18:59:36 Log-Likelihood: 161.29
No. Observations: 2417 AIC: -296.6
Df Residuals: 2404 BIC: -221.3
Df Model: 12
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -35.4270 6.765 -5.237 0.000 -48.693 -22.161
main_camera_mp 0.0190 0.001 14.868 0.000 0.017 0.022
selfie_camera_mp 0.0141 0.001 13.712 0.000 0.012 0.016
ram 0.0127 0.004 3.037 0.002 0.005 0.021
battery 3.168e-05 5.89e-06 5.375 0.000 2.01e-05 4.32e-05
release_year 0.0172 0.003 5.124 0.000 0.011 0.024
new_price_log 0.4084 0.011 37.835 0.000 0.387 0.430
weight_log 0.4995 0.029 17.387 0.000 0.443 0.556
brand_name_BlackBerry -0.1177 0.054 -2.183 0.029 -0.223 -0.012
brand_name_Karbonn 0.1258 0.053 2.381 0.017 0.022 0.229
brand_name_OnePlus 0.1375 0.064 2.148 0.032 0.012 0.263
brand_name_Xiaomi 0.0607 0.025 2.443 0.015 0.012 0.109
4g_yes 0.0337 0.015 2.314 0.021 0.005 0.062
==============================================================================
Omnibus: 122.315 Durbin-Watson: 1.916
Prob(Omnibus): 0.000 Jarque-Bera (JB): 170.281
Skew: -0.467 Prob(JB): 1.06e-37
Kurtosis: 3.904 Cond. No. 5.71e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.71e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
main_camera_mp,selfie_camera_mp,ram,battery,release_year,new_price_log,weight_log,brand_name_BlackBerry,brand_name_Karbonn,bra nd_name_OnePlus,brand_name_Xiaomi,4g_yes have an influnce on used_price
brand_name_BlackBerry show negative effect